Feedback should be send to goran.milovanovic@datakolektiv.com. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
Today we will introduce an idea even more powerful than the Decision Tree model to solve classification and regression problems: the Random Forest model. We will rely on the R package {randomForest} to train Random Forests in both classification and regression contexts. In order to understand Random Forests we will introduce some important theoretical concepts: Bootstrap aggregating (a.k.a. Bagging), Out-of-bag (OOB) error, and Feature Bagging (a.k.a. the random subspace method or attribute bagging). We will see how these approaches in Machine Learning prevent overfitting in the training of a complex model like Random Forest.
install.packages('randomForest')
install.packages('glmnet')
Grab the HR_comma_sep.csv dataset from the Kaggle and place it in your _data directory for this session. We will also use the Boston Housing Dataset: BostonHousing.csv
dataDir <- paste0(getwd(), "/_data/")
library(tidyverse)
library(data.table)
library(randomForest)
library(ggrepel)
We begin by loading and inspecting the HR_comma_sep.csv dataset:
dataSet <- read.csv(paste0(getwd(), "/_data/HR_comma_sep.csv"),
header = T,
check.names = 1,
stringsAsFactors = F)
glimpse(dataSet)
Rows: 14,999
Columns: 10
$ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10, 0.92, 0.89, 0.42, 0.45, 0.11, 0.84, 0.41, 0.36, 0.38, 0.45, 0.78, 0.45...
$ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77, 0.85, 1.00, 0.53, 0.54, 0.81, 0.92, 0.55, 0.56, 0.54, 0.47, 0.99, 0.51...
$ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2, 2, 2, 2, 4, 2, 5, 6, 2, 6, 2, 2, 5, 4, 2, 2, 2, 6, 2, 2, 2, 4, 6, 2, 2...
$ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, 224, 142, 135, 305, 234, 148, 137, 143, 160, 255, 160, 262, 282, 147, 30...
$ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3, 3, 3, 3, 6, 3, 5, 4, 3, 4, 3, 3, 5, 5, 3, 3, 3, 4, 3, 3, 3, 6, 4, 3, 3...
$ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ sales <chr> "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sal...
$ salary <chr> "low", "medium", "medium", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "...
The task is to predict the value of left - whether the employee has left the company or not - from a set of predictors encompassing the following:
There three important concepts to understand how Random Forest builds upon Decision Trees:
Bootstrap aggregating (Bagging). We begin with some training set for our model, \(D\). Bagging generates \(m\) new training sets, \(D_i\), \(i = 1, 2, ..., m\), by randomly sampling from \(D\) uniformly and with replacement. The samples obtained in this way are known as bootstrap samples. In Random Forests, \(m\) simpler models - Decision Trees, precisely - are fitted for each \(D_i\) bootstrap sample.
Out-of-bag (OOB) error. Each time a new bootstrap sample \(D_i\) is produced, some data points remain out of the bag, are not used in model training, and form the OOB set (the Out-of-bag set). The OOB error is a prediction error (remember this concept from our introduction to cross-validation?) which is computed from the OOB set, where the prediction is obtained by averaging the response (in regression) or as a majority vote (in classification) from all the trees in the forest that were not trained on that particular OOB instance.
The Random Subspace Method (Feature Bagging). The Random Subspace Method is a method to control for the complexity of the decision trees grown in a Random Forest model. On each new split, only a randomly chosen subset of predictors is used to find the optimal split. The Random Forests algorithm, as we will see, has a control parameter that determines how many features are randomly selected to produce a new split.
We will fit a Random Forest model to the HR_comma_sep.csv dataset with randomForest::randomForest() in R. In spite of all the precautionary measures already taken in the Random Forest model itself, we will still perform an additional outter k-fold cross-validation with 5 folds: better safe than sorry.
First, define the folds:
dataSet$ix <- sample(1:5, dim(dataSet)[1], replace = T)
table(dataSet$ix)
1 2 3 4 5
3099 2962 2997 2941 3000
Perform a 5-fold cross validation for a set of Random Forests across the following control parameters:
ntree <- seq(250, 1000, by = 250)
mtry <- 1:(dim(dataSet)[2]-2)
# - mtry: we do not count `left` which is an outcome
# - and `ix` which is a fold index, so we have
# - dim(dataSet)[2]-2 predictors in the model.
# - start timer:
tstart <- Sys.time()
rfModels <- lapply(ntree, function(nt) {
# - lapply() across ntree
mtrycv <- lapply(mtry, function(mt) {
# - lapply() across mt
cv <- lapply(unique(dataSet$ix), function(fold) {
# - lapply across folds:
# - split training and test sets
testIx <- fold
trainIx <- setdiff(1:5, testIx)
trainSet <- dataSet %>%
dplyr::filter(ix %in% trainIx) %>%
dplyr::select(-ix)
testSet <- dataSet %>%
dplyr::filter(ix %in% testIx) %>%
dplyr::select(-ix)
# - `left` to factor for classification
# - w. randomForest()
trainSet$left <- as.factor(trainSet$left)
testSet$left <- as.factor(testSet$left)
# - Random Forest:
model <- randomForest::randomForest(formula = left ~ .,
data = trainSet,
ntree = nt,
mtry = mt
)
# - ROC analysis:
predictions <- predict(model,
newdata = testSet)
hit <- sum(ifelse(predictions == 1 & testSet$left == 1, 1, 0))
hit <- hit/sum(testSet$left == 1)
fa <- sum(ifelse(predictions == 1 & testSet$left == 0, 1, 0))
fa <- fa/sum(testSet$left == 0)
acc <- sum(predictions == testSet$left)
acc <- acc/length(testSet$left)
# - Output:
return(
data.frame(fold, hit, fa, acc)
)
})
# - collect results from all folds:
cv <- rbindlist(cv)
# - average ROC:
avg_roc <- data.frame(hit = mean(cv$hit),
fa = mean(cv$fa),
acc = mean(cv$acc),
mtry = mt)
return(avg_roc)
})
# - collect from all mtry values:
mtrycv <- rbindlist(mtrycv)
# - add ntree and out:
mtrycv$ntree <- nt
return(mtrycv)
})
# - collect all results
rfModels <- rbindlist(rfModels)
write.csv(rfModels,
paste0(getwd(), "/rfModels.csv"))
# - Report timing:
print(paste0("The estimation took: ",
difftime(Sys.time(), tstart, units = "mins"),
" minutes."))
[1] "The estimation took: 18.1753468314807 minutes."
print(rfModels)
Let’s inspect the CV results visually. Accuracy first:
rfModels$ntree <- factor(rfModels$ntree)
rfModels$mtry <- factor(rfModels$mtry)
ggplot(data = rfModels,
aes(x = mtry,
y = acc,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: Accuracy") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
Hit rate:
ggplot(data = rfModels,
aes(x = mtry,
y = hit,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: Hit Rate") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
False Alarm rate:
ggplot(data = rfModels,
aes(x = mtry,
y = fa,
group = ntree,
color = ntree,
fill = ntree,
label = round(acc, 2))
) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Random Forests CV: FA Rate") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
And pick the best model from ROC:
rfModels$diff <- rfModels$hit - rfModels$fa
w_best <- which.max(rfModels$diff)
rfModels[w_best, ]
And we can still refine the chosen model:
dataSet <- dataSet %>%
dplyr::select(-ix)
dataSet$left <- factor(dataSet$left)
optimal_model <- randomForest::randomForest(formula = left ~ .,
data = dataSet,
ntree = 250,
mtry = 4)
optimal_model$importance
MeanDecreaseGini
satisfaction_level 2058.254674
last_evaluation 638.526242
number_project 897.076403
average_montly_hours 753.194652
time_spend_company 976.836267
Work_accident 16.064299
promotion_last_5years 2.851143
sales 58.493822
salary 34.418528
The cumulative OOB error up to the i-th tree can be found in the first column of the optimal_model$err.rate field:
head(optimal_model$err.rate)
OOB 0 1
[1,] 0.02388389 0.01576522 0.04924242
[2,] 0.02402065 0.01651917 0.04790982
[3,] 0.02308386 0.01634684 0.04456193
[4,] 0.02230216 0.01514036 0.04501501
[5,] 0.01942832 0.01242054 0.04175756
[6,] 0.01923214 0.01231666 0.04118174
Let’s take a closer look at it:
oobFrame <- as.data.frame(optimal_model$err.rate)
oobFrame$ntree <- 1:dim(oobFrame[1])
ggplot(data = oobFrame,
aes(x = ntree,
y = OOB)) +
geom_path(size = .25) +
geom_point(size = 1.5) +
ggtitle("Cumulative OOB error from the CV optimal Random Forest model") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
We can observe how the cumulative OOB error stabilizes following the growth of a certain number of trees in the Random Forest model:
w_tree <- which.min(oobFrame$OOB)
print(w_tree)
[1] 198
And finally:
optimal_model <- randomForest::randomForest(formula = left ~ .,
data = dataSet,
ntree = w_tree,
mtry = 4
)
predictions <- predict(optimal_model,
newdata = dataSet)
hit <- sum(ifelse(predictions == 1 & dataSet$left == 1, 1, 0))
hit <- hit/sum(dataSet$left == 1)
fa <- sum(ifelse(predictions == 1 & dataSet$left == 0, 1, 0))
fa <- fa/sum(dataSet$left == 0)
acc <- sum(predictions == dataSet$left)
acc <- acc/length(dataSet$left)
print(paste0("Accuracy: ", acc, "; Hit rate: ", hit, "; FA rate: ", fa))
[1] "Accuracy: 0.999933328888593; Hit rate: 1; FA rate: 8.75043752187609e-05"
In {randomForest}, to obtain the Random Forest model for regression simply do not pronounce an outcome to be a factor. We will use the Boston Housing dataset to demonstrate.
dataSet <- read.csv(paste0('_data/', 'BostonHousing.csv'),
header = T,
check.names = F,
stringsAsFactors = F)
head(dataSet)
Here are the variables:
The medv variable is the outcome.
Random Forest (no cross-validation this time):
rfRegMsodel <- randomForest::randomForest(formula = medv ~ .,
data = dataSet,
ntree = 1000,
mtry = 7
)
We can use the generic plot() function to assess how the MSE changes across ntree:
plot(rfRegMsodel)
predictions <- predict(rfRegMsodel,
newdata = dataSet)
predictFrame <- data.frame(predicted_medv = predictions,
observed_medv = dataSet$medv)
ggplot(data = predictFrame,
aes(x = predicted_medv,
y = observed_medv)) +
geom_smooth(method = "lm", size = .25, color = "red") +
geom_point(size = 1.5, color = "black") +
geom_point(size = .75, color = "white") +
ggtitle("Random Forest in Regression: Observed vs Predicted\nBoston Housing Dataset") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 8))
Bagging (Bootstrap Aggregation), from Corporate Finance Institute
A very basic introduction to Random Forests using R, from Oxford Protein Informatics Group
A gentle introduction to random forests using R, Eight to Late
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.